Skip to content

feat: split out @databricks/lakebase-auth with eager refresh and retry#461

Draft
jawj wants to merge 9 commits into
mainfrom
lakebase-auth
Draft

feat: split out @databricks/lakebase-auth with eager refresh and retry#461
jawj wants to merge 9 commits into
mainfrom
lakebase-auth

Conversation

@jawj

@jawj jawj commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Extract OAuth credential/token-refresh logic from @databricks/lakebase.

Move these to a new, lightweight @databricks/lakebase-auth package.

Add eager (default) and lazy refresh, plus retry on transient failures.

Extract OAuth credential/token-refresh logic from @databricks/lakebase.

Move these to a new, lightweight @databricks/lakebase-auth package.

Add eager (default) and lazy refresh, plus retry on transient failures.

Signed-off-by: George MacKerron <georgemackerron@neon.tech>
@github-actions

Copy link
Copy Markdown
Contributor

🔬  Run evals on this PR  ·  Go to Evals Monitor →

@pkosiec pkosiec left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you George for your contribution - this looks awesome!

Took a quick look (didn't test) but my general question would be: does it make sense to introduce a separate package right now? It brings more maintenance burden and I'm not sure if the benefits outweighs the increased complexity.

I'd vote for making all the exports part of the Lakebase package directly, and later on we could split it if there will be such requests from the community. What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a half of the changes required for the release - we'll need to:

  • do the second part in the secure release repo (polling + actual release)
  • create and configure the new package on npm side (with Fabian)
  • allowlist the package in both internal npm proxies

I can take care of those items after my OOO 👍

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the reasons to extract this as a separate package already? Is it only about the size or do you have other concerns?

I totally get the idea of exporting such useful auth-related helpers - but wouldn't it be easier if we did it as a part of the @databricks/lakebase package from the maintenance point of view?

@pkosiec pkosiec left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review — @databricks/lakebase-auth split

Automated multi-reviewer pass (correctness, security, adversarial, reliability, api-contract, testing, maintainability, project-standards). Load-bearing findings were verified against source. No P0/P1 code defects. 11 inline comments below (P2/P3). The DCO sign-off issue is intentionally excluded per request.

Priority order

  1. Eager-timer leakgetLakebasePgConfig/getLakebaseOrmConfig drop dispose, leaking an unstoppable background refresh timer (regression). (pool-config.ts:166)
  2. SSL prefer breaking change — now verifies certs; document + bump @databricks/lakebase minor. (config.ts:46)
  3. Retry classification + min-refresh floor — together they prevent credential-API storms on misconfig/clock-skew. (retry.ts, caching.ts)
  4. Release pin/ordering + knip (below).
  5. Span leak, over-wide exports, lazy-mode token discard, and the P3s as follow-ups.

CI / template (looks good)

  • pack:prerelease packs all four @databricks/* tarballs (lakebase-auth has tarball:prerelease), and prepare-template-artifact.ts copies the lakebase-auth tarball + sets its overrides entry — so the PR template artifact resolves the new package.
  • prepare-release-lakebase-auth.yml correctly mirrors the lakebase two-stage workflow (build → dist → SBOM → pack → SHA256 → upload; no publish/tag/push).
  • New package meets the CLAUDE.md "Creating New Packages" bar (build:package/build:watch, extends root tsconfig); typecheck wiring already fixed on this branch.

One CI gap with no diff line to anchor (#9)

knip.json does not exempt packages/lakebase-auth, while it does exempt packages/lakebase (ignoreWorkspaces). So pnpm run knip (CI lint job) analyzes the new package, and combined with the over-wide exports (see the index.ts comment) it may fail. Run pnpm knip locally; either add lakebase-auth to ignoreWorkspaces for consistency, or trim the exports.

Testing gaps (suggestions, not gates)

  • queryWithTelemetry wrapper (pool.ts:90-148) is untested — only pool.query.name is asserted; the promise/callback branches, error-span recording, queryDuration, and db.rows_affected are never exercised.
  • Eager-mode in-flight dedup is untested (tests flush the initial fetch before getToken()).
  • No test that eager-mode getLakebasePgConfig/getLakebaseOrmConfig avoid leaking a timer; no test that withRetries does not retry a non-transient error; no negative assertion that tokens never appear in logs/errors; no near/at-expiry or NaN-expiry scheduling test.

Residual risks (non-blocking)

  • sslMode: "disable" → no TLS, token sent in cleartext (opt-in, default verify-full); worth a doc warning.
  • No wall-clock timeout on the credential fetch (SDK socket ~5s is the only bound; worst-case retry time can exceed pg.Pool connectionTimeoutMillis).
  • getUsernameWithApiLookup swallows all errors → silent undefined username.
  • getLakebaseOrmConfig drops serverName when narrowing ssl → Bun.SQL via the ORM path loses SNI.

idleTimeoutMillis: poolConfig.idleTimeoutMillis,
connectionTimeoutMillis: poolConfig.connectionTimeoutMillis,
};
return buildLakebasePgConfig(config, telemetry, logger).poolConfig;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 (correctness, reliability) · confidence 100getLakebasePgConfig returns buildLakebasePgConfig(...).poolConfig and discards dispose; getLakebaseOrmConfig builds on it and exposes no dispose either. With the new mode: "eager" default, createPasswordProvider starts an unref'd background timer that re-fetches an OAuth credential roughly hourly forever, per call, with no way to stop it — a regression from the old lazy (timer-free) behaviour for the documented TypeORM/Sequelize path. createLakebasePool is unaffected (it wraps pool.enddispose).

Fix: default these two helpers to refresh: "lazy" (matches old behaviour), or return { ...config, dispose } and document calling it on shutdown.

@@ -44,36 +163,7 @@ export function getLakebasePgConfig(
telemetry?: DriverTelemetry,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 (maintainability) · confidence 100telemetry?/logger? are vestigial: every caller passes only config, yet they're part of the exported getLakebasePgConfig signature. Likewise DriverTelemetry and TokenRefreshDeps (the latter tied to the deprecated createTokenRefreshCallback) are re-exported from packages/lakebase/src/index.ts with no importer outside the package.

Fix: drop the dead params and trim those two exports.

new Date(expiresAt).toISOString(),
);
span.setStatus({ code: SpanStatusCode.OK });
span.end();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 (reliability) · confidence 100span.end() only runs on the happy path. When generateDatabaseCredential throws, the catch at L79 rethrows without ending the span (raw OTel startActiveSpan does not auto-end on throw). Every failed refresh leaks an unclosed span.

Fix: try { … span.setStatus({code: OK}) } catch (e) { span.recordException(e); span.setStatus({code: ERROR}); throw e } finally { span.end() }.

for (const retryDelay of schedule) {
try {
return await fn();
} catch (err) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 (adversarial, reliability) · confidence 100withRetries retries on every thrown error. A non-transient failure (403/permission, ConfigurationError, ValidationError) is still retried across the full [50, 500, 5000]ms schedule (~5.5s) before surfacing — and in eager mode this compounds with the 30s background reschedule, so a misconfigured endpoint fires 4 doomed fetches every 30s for the process lifetime. The docs promise transient retries.

Fix: add a shouldRetry?: (err) => boolean predicate (default-true for back-compat) and fast-fail 4xx/config/validation errors.

case "verify-full": // since JS drivers check root certs, there's an implied sslrootcert=system
case "verify-ca": // upgraded to equivalent of verify-full, sslrootcert=system
case "require": // upgraded to equivalent of verify-full, sslrootcert=system
case "prefer": {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 (api-contract; security: improvement) · confidence 100prefer (and require) now fall through to { rejectUnauthorized: true }, and the default changed requireverify-full. Old @databricks/lakebase mapped prefer{ rejectUnauthorized: false }. This is a security improvement, but a silent breaking behaviour change: a deployment relying on prefer/PGSSLMODE=prefer against an unverifiable cert will now fail to connect.

Fix: add a BREAKING CHANGE changelog entry and bump @databricks/lakebase minor (0.3.0 → 0.4.0); a fix:/chore: commit only yields a patch.

BACKGROUND_RETRY_MS,
err instanceof Error ? err.message : String(err),
);
scheduleNext(BACKGROUND_RETRY_MS);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 (reliability) · confidence 75scheduleNext(BACKGROUND_RETRY_MS) reschedules a failed background refresh exactly 30s later with no jitter. Multiple pools/processes started together that hit the same outage all retry at the same instants (T+30s, T+60s…) — a synchronized thundering herd.

Fix: add jitter, e.g. BACKGROUND_RETRY_MS + Math.random() * BACKGROUND_RETRY_MS * 0.5.


return {
getToken() {
if (Date.now() > refreshAfter) cachedToken = refresh();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 (adversarial) · confidence 80getToken() does cachedToken = refresh(), overwriting the cache with a possibly-rejecting promise while the previous token is still valid for up to earlyRefreshMs (default 2 min). A transient outage in that pre-expiry window makes lazy mode fail 100% of new connections even though a usable token was in hand (eager mode doesn't have this).

Fix: keep the last successfully-resolved token+expiry separately and serve it until actual expiry; only surface the rejection once no valid token remains.

getUsernameSync,
getUsernameWithApiLookup,
getWorkspaceClient,
mapSslConfig,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 (maintainability, api-contract) · confidence 100parseConfig, ParsedAuthConfig, mapSslConfig, getUsernameSync, withRetries, DEFAULT_RETRY_SCHEDULE are exported from the package entry but imported by no consumer outside this package. Once published they become committed public API (breaking to remove later).

Fix: trim the surface to getPgConfig, createPasswordProvider, generateDatabaseCredential, getUsernameWithApiLookup, errors and types; keep the rest as internal module exports. (Also addresses the knip note in the review summary.)

Comment thread tools/dist-lakebase.ts
`../packages/${pkgDirName}/package.json`,
);
const depPkg = JSON.parse(fs.readFileSync(depPkgPath, "utf-8"));
pkg.dependencies[depName] = `${depPkg.version}`;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2/P3 (api-contract, project-standards) · confidence 75/70 — two issues on this version pin:

  • Exact pin + release ordering: lakebase's workspace:* dep on @databricks/lakebase-auth is rewritten to the exact version. The two prepare-release-* workflows are independent and path-triggered, so nothing guarantees lakebase-auth@0.1.0 publishes before the lakebase release that requires it (first publish breaks otherwise); the exact pin also blocks lakebase-auth patch fixes from reaching lakebase consumers without re-releasing lakebase. Fix: emit `^${depPkg.version}` and coordinate lakebase-auth-first on the initial publish.
  • Prerelease mismatch: in a prerelease build the lakebase-auth tarball is 0.1.0-pr.<sha> but this still pins plain 0.1.0. It only resolves in the template because prepare-template-artifact.ts adds an overrides entry (matched by name); installing the lakebase prerelease tarball directly would try to resolve 0.1.0 from the registry → stale/missing.

const provider = cachedWithTimedRefresh(fetch, EARLY, onLog);
await vi.advanceTimersByTimeAsync(0); // initial attempt fails
expect(fetch).toHaveBeenCalledTimes(1);
expect(onLog).toHaveBeenCalled();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 (testing) · confidence 95expect(onLog).toHaveBeenCalled() only proves the function ran; it won't catch a wrong level, format string, or args. (Same pattern at pool-config.test.ts:193 with logger.error.) Compare retry.test.ts, which asserts all four args.

Fix: expect(onLog).toHaveBeenCalledWith('warn', expect.stringContaining('retrying'), 30000, expect.stringContaining('boom')).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants